Clustering via centroids a bag of qualitative values and measuring its inconsistency
نویسنده
چکیده
It is well understood how to compute the average of a set of numeric values; thus, handling inconsistent measurements is possible. Recently, using confusion, we showed a way to compute the “average,” consensus or centroid of a bag of assertions (made by observers) about a non-numeric property, such as John’s pet. The values of those assertions lie in a hierarchy. Intuitively, such consensus minimizes the discomfort of all observers (of the pet) when they know which of the animals of the bag was selected as the consensus pet. The inconsistency of the bag is such total discomfort divided by the bag’s size. It is a number that tells how far apart the values of the bag are. It should be emphasized that an asserted value obtained by an observer (such as Schnauzer in “the pet was a Schnauzer”) represents not only itself, but all the values from it up to the root of the hierarchy: Schnauzer, dog, mammal, animal, living creature. A bag of dissimilar assertions will have a large inconsistency, which could diminish if the problem at hand allows several centroids to be selected. John could have two pets, and the inconsistency of these two “consensus values” with all observations will be much better (much smaller): one part of the observers will feel little discomfort with one of the centroids; the remaining part will feel little discomfort with the second centroid. This chapter finds the set of centroids of a bag of qualitative values that minimizes the inconsistency of the bag; that is, the total discomfort of all members of the bag will be smallest. These centroids define clusters, that is, subsets of the bag. All observers are equally credible, so differences in their findings arise from perception errors, and from the limited accuracy of their individual findings.
منابع مشابه
A New Kernelized Fuzzy C-Means Clustering Algorithm with Enhanced Performance
Recently Kernelized Fuzzy C-Means clustering technique where a kernel-induced distance function is used as a similarity measure instead of a Euclidean distance which is used in the conventional Fuzzy C-Means clustering technique, has earned popularity among research community. Like the conventional Fuzzy C-Means clustering technique this technique also suffers from inconsistency in its performa...
متن کاملTowards Finding a New Kernelized Fuzzy C-means Clustering Algorithm
Kernelized Fuzzy C-Means clustering technique is an attempt to improve the performance of the conventional Fuzzy C-Means clustering technique. Recently this technique where a kernel-induced distance function is used as a similarity measure instead of a Euclidean distance which is used in the conventional Fuzzy C-Means clustering technique, has earned popularity among research community. Like th...
متن کاملGraph based Text Document Clustering by Detecting Initial Centroids for k-Means
Document clustering is used in information retrieval to organize a large collection of text documents into some meaningful clusters. k-means clustering algorithm of pratitional category, performs well on document clustering. k-means organizes a large collection of items into k clusters so that a criterion function is optimized. As it is sensitive to the initial values of cluster centroids, this...
متن کاملReducing Inconsistency in Integrating Data From Different Sources
One of the main problems in integrating databases into a common repository is the possible inconsistency of the values stored in them, i.e., the very same term may have different values, due to misspelling, a permuted word order, spelling variants and so on. In this paper, we present an automatic method for reducing inconsistency found in existing databases, and thus, improving data quality. Al...
متن کاملTowards information-theoretic K-means clustering for image indexing
Information-theoretic K-means (Info-Kmeans) aims to cluster high-dimensional data, such as images featured by the bag-of-features (BOF) model, using K-means algorithm with KL-divergence as the distance. While research efforts along this line have shown promising results, a remaining challenge is to deal with the high sparsity of image data. Indeed, the centroids may contain many zero-value feat...
متن کامل